April  1995 


UILU-ENG-95-2208 

CRHC-95-03 


Center  for  Reliable  and  High  Performance  Computing 

DTIC 

ELECTE 
MAYl  01995 

C 

Reducing  Cache  Misses  in  Numerical 
Applications  Using  Data  Relocation 
and  Prefetching 


Y.  Yamada,  T.  L.  Johnson,  G.  E.  Haab,  J.  C.  Gyllenhaal,  H-M.  Hwu 
&  J.  Torrellas 


19950508  123 

Coordinated  Science  Laboratory 
College  of  Engineering 

UNIVERSITY  OF  ILLINOIS  AT  URB ANA-CHAMPAIGN 


Approved  for  Public  Release.  Distribution  Unlimited. 


UNCLASSIFIED  _  _  _ 

SECURITY  CLASSIFICATION  OF  THIS  PACjE 


REPORT  DOCUMENTATION  PAGE 


la.  REPORT  SECURITY  CLASSIFICATION 

lb.  RESTRICTIVE  MARKINGS 

2a.  SECURITY  CLASSIFICATION  AUTHORITY 

3.  DISTRIBUTION /AVAILABILITY  OF  REPORT 

Approved  for  public  release; 
distribution  unlimited 

2b.  DECLASSIFICATION /DOWNGRADING  SCHEDULE 

4.  performing  organization  REPORT  NUMBER(S) 

5.  MONITORING  ORGANIZATION  REPORT  NUMBER(S) 

6a.  NAME  OF  PERFORMING  ORGANIZATION 
Coordinated  Science  Lab 
University  of  Illinois 

6b.  OFFICE  SYMBOL 
(If  applicable) 

N/A 

7a.  NAME  OF  MONITORING  ORGANIZATION  •‘•1  Mazda 

2)  NSF  3)  JSEP  4)  NASA 

6c  ADDRESS  (Q'ty,  State,  and  ZIP  Code) 

'  1308  W.  Main  St. 

Urbana,  IL  61801 

•  7b.  ADDRESS  (Crty,  State,  arK/ Z/P  Code)  1)  2-5  Morlya-Uho, 
Kanagawa-Ku.  Yokohama,  Kanagawa,  221  JAPAN 

2)  4201  Wilson  Blvdi,  Arlington,  VA  22230 

t]  gEi; 

8a.  NAME  OF  FUNDING /SPONSORING 
ORGANIZATION 

7a. 

8b.  OFFICE  SYMBOL 
(If  applicable) 

9.  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 

10.  SOURCE  OF  FUNDING  NUMBERS 

7b. 

PROGRAM  PROJECT  TMK 

ELEMENT  NO.  NO.  NO.  [ACCESSION  NO. 

jnii  (Include  Security  Claisification)  .  ^  ^  j 

Reducing  Cache  Misses  in  Numerical  Applications  Using  Data  Relocation  and  Prefetching 


12  PERSONAL  AUTHOR{S)  Y.  Yamada,  Teresa  L.  Johnson,  Grant  E.  Haab,  John  C.  Gyllenhaal,  Wen  mei 

W.  Hwu,  and  Josep  Torrellas  _ _ _ 


13a.  TYPE  OF  REPORT 

Technical 


13b.  TIME  COVERED 
FROM _ TO  , 


14.  DATE  OF  REPORT  (Year,  Month,  Day) 

April  1995  _ 


PAGE  COUNT 

28 


16.  SUPPLEMENTARY  NOTATION 


17. 


COSATl  CODES 


FIELD 


GROUP 


SUB-GROUP 


18.  SUBJECT  TERMS  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 
cache  conflicts,  data  copying,  data  relocation,  program 
optimization,  software  prefetching 


9.  ABSTRACT  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 


Numerical  applications  frequently  contain  nested  loops  that  process  large  arrays  of  data.  The 
execution  of  these  loop  structures  often  produces  memory  reference  patterns  that  utilize  data 
caches  poorly.  Indeed,  poor  reuse  of  the  data,  large  working  set  sizes,  and  frequent  non-unit 
stride  accesses  all  combine  to  cause  many  cache  misses.  To  improve  cache  performance,  data 
copying  has  been  proposed.  However,  this  technique  has  high  overhead. 


(OVER) 


20.  DISTRIBUTION /AVAILABILITY  OF  ABSTRACT 

S  UNCLASSIFIED/UNLIMITED  □  SAME  /VS  RPT.  □  DTIC  USERS 

21.  ABSTRACT  SECURITY  CUVSSIFICATION 

Unclassified 

22a.  NAME  OF  RESPONSIBLE  INDIVIDUAL 

22b.  TELEPHONE  prK/ude  Area  Code) 

22c.  OFFICE  SYMBOL 

DD  FORM  1473, 84  MAR 


All  other  editionj  are  obiolete. 


SECURITY  CLASSIFICATION  OF  THIS  PAGE 
UNCLASSIFIED 


UNCLASSIFIED _ 

fECUWITY  CLASilFICATlOK  OF  THIl  PAQl 
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In  this  paper,  instead,  we  propose  a  combined  hardware  and  software  technique  called  data  relo¬ 
cation  and  prefetching  which  eliminates  much  of  the  overhead  of  data  copying  through  the  use  of 
special  hardware.  Furthermore,  by  relocating  the  data  while  performing  software  prefetching,  the 
overhead  of  copying  the  data  can  be  reduced  further.  This  technique  performs  better  than  pre¬ 
fetching  alone  because  it  reduces  cache  misses  through  relocation,  and  it  reduces  overhead  by 
prefetching  multiple  elements  at  once.  The  hardware  is  designed  to  overlap  relocation  and  pre¬ 
fetching  with  normal  execution,  and  to  highly  utilize  the  available  bus  bandwidth.  Simulation 
results  show  that  this  technique  greatly  reduces  data  cache  miss  rates.  As  a  result,  largein  this 
paper,  instead,  we  propose  a  combined  hardware  and  software  technique  called  data  relocation 
and  prefetching  which  eliminates  much  of  the  overhead  of  data  copying  through  the  use  of  spe¬ 
cial  hardware.  Furthermore,  by  relocating  the  data  while  performing  software  prefetching,  the 
overhead  of  copying  the  data  can  be  reduced  further.  This  technique  performs  better  than  pre¬ 
fetching  alone  because  it  reduces  cache  misses  through  relocation,  and  it  reduces  overhead  by 
prefetching  multiple  elements  at  once.  The  hardware  is  designed  to  overlap  relocation  and  pre¬ 
fetching  with  normal  execution,  and  to  highly  utilize  the  available  bus  bandwidth.  Simulation 
results  show  that  this  technique  greatly  reduces  data  cache  miss  rates.  As  a  result,  large  applica¬ 
tions  including  PERFECT  and  SPEC  benchmarks  achieve  up  to  2.5  times  speedup.  The  hardware 
support  required  by  this  technique  has  been  greatly  refined  over  that  presented  in  an  earlier  paper. 
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Abstract 

Numerical  applications  frequently  contain  nested  loops  that  process  large  arrays  of  data.  The 
execution  of  these  loop  structures  often  produces  memory  reference  patterns  that  utilize  data 
caches  poorly.  Indeed,  poor  reuse  of  the  data,  large  working  set  sizes,  and  frequent  non-unit 
stride  accesses  all  combine  to  cause  many  cache  misses.  To  improve  cache  performance,  data 
copying  has  been  proposed.  However,  this  technique  has  high  overhead. 

In  this  paper,  instead,  we  propose  a  combined  hardware  and  software  technique  called  data 
relocation  and  prefetching  which  eliminates  much  of  the  overhead  of  data  copying  through  the 
use  of  special  hardware.  Furthermore,  by  relocating  the  data  while  performing  software  prefetch¬ 
ing,  the  overhead  of  copying  the  data  can  be  reduced  further.  This  technique  performs  better 
than  prefetching  alone  because  it  reduces  cache  misses  through  relocation,  and  it  reduces  over¬ 
head  by  prefetching  multiple  elements  at  once.  The  hardware  is  designed  to  overlap  relocation 
and  prefetching  with  normal  execution,  and  to  highly  utilize  the  available  bus  bandwidth.  Sim¬ 
ulation  results  show  that  this  technique  greatly  reduces  data  cache  miss  rates.  As  a  result,  large 
applications  including  PERFECT  and  SPEC  benchmarks  achieve  up  to  2.5  times  speedup.  The 
hardware  support  required  by  this  technique  has  been  greatly  refined  over  that  presented  in  an 
earlier  paper. 

Index  terms  -  Cache  conflicts,  data  copying,  data  relocation,  program  optimization,  software 
prefetching. 


1  Introduction 

Numerical  applications  frequently  contain  nested  loops  that  process  large  arrays  .  The  execution 
of  these  loop  structures  has  been  shown  to  produce  memory  reference  patterns  that  under-utilize 
data  caches  [1][2].  This  is  caused  by  at  least  three  problems.  First,  large  working  set  sizes  cause 
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cache  overflow,  resulting  in  cache  misses.  Secondly,  non-unit  stride  access  patterns  can  cause  low 
utilization  of  cache  lines,  which  increases  cache  conflicts,  in  addition  to  wasted  bus  and  memory 
cycles  [3].  Finally,  low  reuse  of  data  results  in  poor  cache  use,  and  therefore  more  cache  misses. 

Potentially,  one  could  use  a  larger  cache  size  and  higher  cache  associativity  to  reduce  some  of 
these  effects.  This  brute  force  approach,  however,  does  not  scale  well  with  larger  problem  sizes. 
Moreover,  it  would  result  in  significant  hardware  cost  and  increased  cache  access  latency,  both  of 
which  could  be  avoided  via  the  more  cost-effective  approach  proposed  in  this  paper. 

The  use  of  loop  blocking  transformations  is  often  effective  in  improving  performance  of 
caches  [2]  [4]  [5].  By  partitioning  the  iteration  space,  loop  blocking  transformations  reduce  the 
amount  of  data  referenced  between  two  references  to  the  same  datum,  thereby  increasing  the  po¬ 
tential  for  data  reuse.  In  practice,  however,  it  has  been  shown  [4]  that  loop  blocking  transformations 
suffer  from  cache  mapping  conflicts.  Additionally,  blocking  alone  is  not  effective  for  singly-nested 
loops  since  the  data  accesses  are  not  reordered.  Data  copying  has  been  proposed  to  reduce  the 
cache  conflict  misses  [1]  [4]  [6],  however  the  overhead  is  significant. 

Data  prefetching  has  also  been  proposed  to  reduce  cache  misses  by  fetching  data  into  the  cache 
before  it  is  referenced  [7]  [8].  When  used  in  conjunction  with  small  cache-block  sizes,  one  can 
potentially  eliminate  the  problem  of  low  utilization  of  cache  blocks  and  wasted  bus  cycles  [3]. 
However,  data  prefetching  may  increase  the  size  of  the  working  set,  introducing  capacity  misses. 
Also,  prefetched  data  may  conflict  with  the  current  working  set  in  the  cache,  introducing  more 
conflict  misses  [9]  [10].  In  order  for  data  prefetching  to  improve  performance  in  a  reliable  manner, 
one  must  ensure  that  both  current  and  future  working  sets  can  fit  into  the  cache. 

We  propose  an  approach,  called  Data  Relocation  and  Prefetching  {DRP),  that  prefetches  the 
array  element  read  references  into  consecutive,  and  therefore  non-conflicting,  locations  within  the 
cache.  This  is  implemented  with  combined  hardware  and  compiler  support  which  has  less  overhead 
than  the  traditional  data  copying  approach,  because  the  relocation  is  integrated  with  prefetching. 
Also,  compression  and  prefetching  of  the  next  working  set  is  overlapped  with  the  computation  for 
the  current  working  set  in  order  to  hide  the  latency  of  the  relocation.  With  this  technique,  if  the 
original  data  access  pattern  is  of  non-unit  stride,  unused  data  are  not  brought  into  the  cache  during 
compression  and  prefetch,  resulting  in  improved  cache-line  utilization. 

The  initial  version  of  this  technique  was  proposed  in  an  earlier  paper  [11],  which  focused  on 


2 


compiler  support  and  gave  a  rough  description  of  the  hardware  support.  Preliminary  results  were 
also  presented.  This  paper  presents  a  more  detailed  description  of  the  architectural  and  hardware 
support,  which  has  been  redesigned  for  improved  efficiency.  Improvements  include  greatly  reduced 
instruction  set  modifications  required,  a  much  wider  range  of  hardware  issues  that  are  discussed, 
and  even  more  encouraging  experimental  results. 

Overlap  of  program  execution  with  DRP,  high  utilization  of  the  memory  bus  bandwidth,  and 
support  of  out-of-order  return  of  requests  from  the  memory  system  are  important  objectives  the 
hardware  should  satisfy.  Four  of  the  main  components  of  the  hardware  unit  were  designed  to  satisfy 
these  objectives:  the  Precollect  Status  Store,  Instruction  Queue,  Outstanding  Memory  Request 
Store  and  Block  Assembly  Cache.  The  Precollect  Status  Store,  which  provides  synchronization 
between  the  DRP  unit  and  the  CPU,  and  the  Instruction  Queue,  which  allows  the  DRP  unit  to 
operate  asynchronously  with  the  CPUhy  buffering  DRP  requests,  combine  to  help  fully  overlap 
program  execution  and  DRP  instruction  execution.  Additionally,  the  Outstanding  Memory  Request 
Store,  which  allows  multiple  outstanding  memory  requests,  provides  high  utilization  of  the  memory 
bus,  and  the  Block  Assembly  Cache  handles  out-of-order  return  of  requests  from  the  memory 
system.  The  operation  of  each  component  will  be  discussed  in  detail  in  Section  2.4. 

Using  the  IMPACT  compiler  [12],  an  emulation  tool,  and  a  simulation  tool,  we  show  that  this 
extension  to  the  cache  and  processor  architecture  along  with  the  requisite  compiler  support  greatly 
improves  the  data  cache  performance  for  array-based  applications.  Cache  miss  rates  are  reduced 
for  all  the  applications  tested,  and  are  nearly  eliminated  for  several  applications.  Up  to  2.5  times 
speedup  is  achieved,  and  the  average  speedup  is  over  1.5. 

1.1  Related  Work 

The  technique  proposed  here  is  conceptually  similar  to  the  gather  operation  used  in  the  Cray- 
1  [13].  In  the  Cray-1,  the  array  elements  are  “gathered”  from  memory  into  the  vector  registers 
before  performing  vector  operations,  and  “scattered”  back  to  memory  after  the  vector  operations 
are  complete.  However,  the  hardware  necessary  to  support  data  relocation  and  prefetching  would 
be  much  easier  to  add  to  an  existing  processor  than  the  hardware  to  support  vectorization. 

Chen  and  Baer  [7]  presented  a  hardware  approach  which  preloads  blocks  for  accesses  with 
constant  strides.  However,  their  method  does  not  attempt  to  reduce  conflict  misses  because  it  does 
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not  change  the  layout  of  the  prefetched  data  within  the  cache.  On  the  other  hand,  our  technique 
relocates  only  data  elements  which  will  be  referenced,  and  maps  them  into  sequential  locations 
within  the  cache. 

Several  techniques  have  been  proposed  to  help  reduce  cache  conflict  misses,  and  can  be  used 
in  conjunction  with  DRP.  The  victim  cache  [14],  the  column-associative  cache  [15],  and  the  assist 
cache  [16]  can  be  used  with  DRP  to  reduce  conflict  misses  for  untransformed  accesses.  These 
strategies  avoid  some  of  the  cost  of  an  associative  cache  by  providing  a  fast  access  for  cache  hits 
and  a  slightly  slower  path  for  references  that  conflict  in  a  direct-mapped  cache.  Additionally,  the 
Cache  Miss  Lookaside  buffer  [17]  can  be  used  to  reduce  conflict  misses  in  a  large  second-level  cache, 
while  DRP  is  used  for  first-level  caches. 

The  remainder  of  this  paper  is  organized  into  four  sections.  Section  2  presents  the  proposed 
technique  and  describes  the  necessary  architecture,  compiler,  and  hardware  support.  In  Section  3, 
the  simulation  environment  is  detailed,  and  in  Section  4  simulation-based  experimental  results 
are  provided  to  demonstrate  the  effectiveness  of  the  proposed  technique.  Finally,  Section  5  offers 
concluding  remarks  and  future  directions. 

2  Data  Relocation  and  Prefetching 

2.1  Overview 

We  propose  a  hardware-based,  compiler-supported  technique  called  data  relocation  and  prefetching 
in  order  to  improve  the  data  cache  performance  .  In  this  method,  the  array  elements  read  in  the 
inner  loop  of  a  nest  are  sequentially  mapped  into  special  relocation  buffers  within  the  cache  before 
they  are  accessed.  Special  hardware  that  is  attached  to  the  cache  unit  performs  this  relocation  of 
data  into  sequential  locations  while  prefetching  the  data  from  the  memory  to  the  cache,  without 
stalling  the  CPf/.  The  relocation  operations  are  invoked  by  an  explicit  precollect  instruction  inserted 
by  the  compiler.  The  compiler  also  inserts  a  declaration  into  the  original  code  to  allocate  a  relocation 
buffer  in  memory  for  the  relocated  data. 

Because  the  array  data  is  relocated,  the  prefetch  is  binding.  During  the  computation,  the  newly 
assigned  address  in  the  relocation  buffer  is  used  to  access  the  data  rather  than  the  original  address. 
Consequently,  the  relocation  must  be  completed  before  the  computation  on  the  same  data  begins. 
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a)  Original  Code 


b) Transformed  Code 


for  (i  =  0;  i  <  N;  i++) 
for  (j  =  0;  j  <  3;  j++) 

. . .  =  A[i]  C2*j]  +  B[j]  [i] ; 


for  (i  =  0;  i  <  N;  i++) 
for  (j  =  0;  j  <  3;  j++) 
...  =  A' [j]  +  B' [j] ; 


Layout  in  memory 


Relocation  Buffer 


Layout  in  cache 


Figure  1:  Concept  of  Data  Relocation 


So  if  the  relocated,  cached  data  is  replaced  by  some  other  data,  the  relocated  data  is  automatically 
written  back  to  the  relocation  buffer  in  memory  since  the  accesses  in  the  computation  use'  the 
address  of  the  relocated  data.  To  insure  that  this  write-back  occurs,  the  dirty  bit  is  set  when  the 
cache  line  for  the  data  is  allocated.  Modifications  to  this  algorithm  necessary  for  the  use  of  a  write- 
through  cache,  instead  of  a  write-back  cache,  will  be  discussed  in  Section  2.4.  Array  references  that 
are  written  in  the  loop  are  never  transformed,  as  will  be  discussed  in  Section  2.2.2,  so  the  relocated 
data  will  never  need  to  be  written  back  to  the  original  address. 

DRP  can  improve  the  spatial  locality  of  array  accesses  for  a  loop  nest.  Figure  1  shows  how 
array  data  elements  accessed  in  the  first  iteration  of  the  outer  loop  are  copied  to  sequential  cache 
locations  that  map  to  the  relocation  buffer  in  memory.  Array  A  is  accessed  with  a  stride  of  two, 
and  array  B  is  accessed  in  column  order  during  the  execution  of  the  inner  loop.  Accesses  to  these 
array  elements  can  result  in  poor  performance  because; 

1.  The  accesses  may  have  low  spatial  locality  because  of  the  non- unit  access  stride,  resulting  in 
wasted  cache  capacity  which  may  lower  the  cache  hit  rate. 

2.  The  sets  of  accesses  for  different  arrays  may  conflict  with  each  other  because  they  may  be 
mapped  to  some  of  the  same  locations  in  the  cache. 

3.  The  accesses  for  a  single  array  may  conflict  with  each  other  because  of  a  large  access  stride. 
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If  the  accessed  elements  of  these  arrays  are  relocated  into  the  cache,  spatial  locality  can  be  improved 
by  packing  elements  of  the  arrays  into  contiguous  locations.  This  increased  spatial  locality  can 
eliminate  conflicts  between  elements  of  the  same  array,  as  well  as  between  those  of  different  arrays. 
Also,  since  only  necessary  elements  are  brought  into  the  cache,  the  extra  memory  requests  and  time 
to  fill  the  cache  line  due  to  the  non-unit  stride  accesses  are  reduced.  Furthermore,  if  the  total  size 
of  the  relocated  array  elements  is  smaller  than  the  cache  size,  the  compression  guarantees  that  the 
references  to  the  relocated  data  do  not  conflict  with  each  other  in  the  cache.  Finally,  cache  space 
is  conserved  by  packing  elements  of  the  arrays. 

In  order  to  reduce  the  instruction-fetch  overhead  due  to  the  inserted  precollect  instructions,  each 
precollect  instruction  contains  enough  information  to  operate  on  several  elements  of  the  array  in 
sequence.  Also,  in  order  to  accommodate  the  latency  of  the  relocation  of  array  data,  the  relocation 
and  computation  phases  are  separated  in  time  by  software-pipelining  the  outer  loop. 

2.2  Architectural  Support 

Implementing  the  mechanism  for  the  DRP  technique  requires  an  extra  instruction  as  well  as  extra 
hardware.  In  the  following,  we  first  describe  the  new  instruction,  and  then  discuss  instructions 
eliminated  from  the  original  design  [11]. 

2.2.1  Precollect  Instruction 

To  support  DRP,  a  new  precollect  instruction  is  added  to  the  instruction  set  of  the  processor.  This 
instruction  compacts  the  array  data  referenced  in  a  computation  into  consecutive  memory  locations 
in  the  cache  before  the  data  is  needed  for  the  computation.  These  locations  together  are  called  a 
relocation  bujfer.  The  precollect  instruction  also  sets  the  dirty  bit  of  the  cache  line  so  that  if  the 
relocated  data  is  displaced  from  the  cache,  it  is  backed  up  in  a  relocation  buffer  in  memory.  The 
precollect  instruction  accesses  the  cache  to  find  the  source  data  first.  If  it  misses  then  a  request  is 
sent  to  the  memory  system  for  the  data.  In  either  case,  the  processor  does  not  stall  from  cache 
accesses  caused  by  a  precollect  instruction. 

The  precollect  instruction  has  five  operands.  The  first  operand  is  the  address  of  the  first  element 
of  the  array  to  be  relocated.  The  second  operand  is  the  address  of  the  first  element  of  the  relocated 
array  in  the  relocation  buffer.  Finally,  the  third  through  the  fifth  operands  are  the  size  of  each 
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array  element  in  bytes,  the  stride  of  the  array  accesses  in  bytes,  and  the  number  of  array  elements 
to  be  collected,  respectively. 

Since  most  RISC  architectures  do  not  allow  five  operands,  the  precollect  instruction  is  actually 
implemented  with  two  machine  instructions.  The  first  machine  instruction  is  an  immediate  load, 
which  encodes  the  element  size,  stride  and  number  of  data  elements  in  a  register.  Then  the  new 
machine  instruction  called  precollect  has  three  register  operands;  the  array  address,  the  buffer 
address,  and  the  encoded  register.  If  the  instruction  set  architecture  only  supports  two  source 
register  operands,  then  the  buffer  address  may  be  specified  as  an  immediate  value,  since  it  requires 
the  fewest  bits  of  the  three  operands.  Because  the  data  element  size,  stride,  and  number  of  data 
elements  are  likely  to  be  loop-invariant,  the  immediate  load  is  usually  moved  out  of  the  outer  loop 
by  our  compiler.  Therefore,  the  overhead  of  this  instruction  is  minimal. 

2.2.2  Instructions  Eliminated  from  Previous  Work 

The  previous  version  of  this  technique  that  we  proposed  [11]  required  five  extra  instructions,  not 
only  precollect,  but  also  await,  preallocate,  distribute  and  finishup.  The  await  was  used  as  a  synchro¬ 
nization  mechanism  to  avoid  accessing  the  relocated  array  data  before  the  precollect  was  completed. 
We  were  able  to  remove  this  extra  instruction  by  adding  hardware  to  enforce  this  constraint,  as 
will  be  explained  in  Section  2.4. 

For  data  that  is  written  only,  a  write-no-allocate  cache  is  used  so  that  these  references  are  simply 
sent  to  the  write  buffer  if  cache  misses  occur,  thereby  avoiding  cache  conflicts.  The  preallocate, 
distribute  and  finishup  instructions  were  used  to  support  relocation  and  prefetching  of  data  that  is 
written  and  later  read  within  the  inner  loop.  Experimental  evidence  suggests  that  this  support  is 
unwarranted,  for  only  two  loop  nests  in  all  of  the  applications  that  we  tested  contained  references 
of  this  nature  that  were  being  transformed.  When  we  did  not  transform  these  references,  less  than 
a  1%  degradation  in  performance  occurred.  The  fact  that  few  loads  follow  stores  to  the  same 
array  within  the  same  inner-most  loop  in  our  applications,  coupled  with  the  use  of  a  write-no- 
allocate  cache  as  described  above,  results  in  this  small  performance  degradation.  Therefore,  we 
have  eliminated  the  preallocate,  distribute  and  finishup  instructions. 
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2.3  Compiler  Support 


We  apply  the  data  relocation  and  prefetching  optimization  to  loops  nested  two  or  more  deep.  For 
loop  nests  which  are  nested  more  deeply  than  two,  the  transformation  is  applied  to  the  two  inner¬ 
most  loops.  In  this  case,  we  refer  to  the  outer  of  the  two  inner-most  loops  as  the  outer  loop,  and 
the  inner-most  loop  as  the  inner  loop.  Singly  nested  loops  are  transformed  into  doubly-nested  loops 
via  loop  strip-mining,  discussed  in  Section  A.l,  so  that  relocation  and  prefetching  can  be  applied 
to  them. 

Before  each  execution  of  the  inner  loop,  the  array  data  accessed  in  the  inner  loop  are  relocated 
and  prefetched.  In  order  to  allow  sufficient  time  for  precollect  instructions  to  complete,  we  overlap 
the  execution  of  one  outer  loop  iteration  with  the  precollection  of  data  used  in  the  next  outer  loop 
iteration.  Therefore,  two  relocation  buffers  are  required  for  each  array  we  precollect  (prefetch  and 
relocate),  one  for  the  executing  iteration  and  one  for  the  next  iteration  which  is  precollecting. 

The  code  transformations  employed  for  data  relocation  and  prefetching  are  explained  in  detail 
in  [18].  The  compiler  algorithms  have  been  improved  over  those  in  previous  work  [11]  so  that  the 
DRP  technique  can  be  applied  more  broadly  and  effectively.  The  most  important  of  the  compiler 
transformations  are  explained  in  Appendix  A,  and  an  example  is  given. 

2.4  Hardware  Support 

The  hardware  support  required  for  DRP  is  an  on-chip  module  called  the  Data  Relocation  and 
Prefetching  unit.  The  DRP  unit  receives  precollect  instructions  from  the  processor.  It  fetches  data 
from  the  cache  or  main  memory,  one  element  at  a  time,  and  then  assembles  each  cache  block  in 
an  assembly  buffer  and  writes  it  into  the  cache.  The  DRP  unit  needs  to  access  both  the  bus  and 
the  data  cache.  Therefore,  the  DRP  unit  shares  the  cache  with  the  CPU  and  shares  the  bus  with 
the  cache  (Figure  2).  In  this  configuration,  both  the  processor  and  the  cache  have  higher  priority 
than  the  DRP  unit  to  access  shared  resources.  This  priority  hierarchy  helps  to  ensure  that  the 
DRP  unit  does  not  interfere  with  the  processor  accesses  to  the  memory  system  while  a  precollect 
instruction  is  being  executed.  We  verified  via  simulation  that  the  DRP  unit  should  have  lower 
priority  in  accessing  the  bus  [18]. 

In  our  current  implementation,  we  use  a  virtually-addressed  write-back  data  cache.  However,  a 
physically-addressed  cache  can  also  be  used,  as  discussed  in  Section  2.4.6.  If  a  write-through  cache 
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Figure  2:  Data  Relocation  and  Prefetch  Unit  Interfaces 

is  used  instead,  when  the  precollected  data  is  written  to  the  data  cache,  it  must  also  be  placed  in 
the  write  buffer.  This  is  to  ensure  that  the  relocated  data  is  also  initially  placed  in  the  relocation 
buffer  in  memory,  because  there  is  no  provision  to  ensure  that  the  relocated  data  will  be  written 
back  to  memory  when  it  is  replaced  from  a  write-through  cache.  A  dedicated  read  port  into  the 
data  cache  is  required  by  the  DRP  unit.  It  is  used  to  check  if  data  to  be  precollected  resides  in  the 
cache.  This  read  port  should  not  trigger  a  cache  miss  if  the  data  is  not  found  in  the  cache:  If  the 
data  is  not  found  in  the  cache,  the  DRP  unit  accesses  memory  using  its  connection  to  the  memory 
bus.  Furthermore,  the  DRP  unit  itself  does  not  block  when  fetching  data  from  a  memory  location. 
If  a  high-bandwidth  memory  system  like  a  split- transaction  bus  system  is  used,  many  read  and 
write  requests  issued  by  the  DRP  unit  can  be  in  progress  at  the  same  time. 

The  structure  of  the  DRP  unit  is  shown  in  Figure  3.  It  has  five  main  parts:  Precollect  Status 
Store,  Instruction  Queue,  Address  Generator,  Outstanding  Memory  Request  Store,  and  DRP  Block 
Assembly  Cache.  As  discussed  in  Section  1,  the  Precollect  Status  Store  and  Instruction  Queue 
satisfy  the  objective  of  overlapping  program  execution  with  DRP,  the  Outstanding  Memory  Request 
Store  helps  fully  utilize  the  memory  bus  bandwidth,  and  Block  Assembly  Cache  supports  out-of- 
order  return  of  data  from  memory.  In  the  following,  we  consider  each  component  in  turn. 


9 


Prccolicct  Instructions 


Data  Relocation 
and  Prefetch  Unit 


□RP  state 
information 
(to  processor) 


Input  1 

2 

3  4 

ports; 

In.struction  Queue  (IQ) 

Array  Addr 

1  Sutter  Addr  |  Stride  |  Size  |  « Oala 

A-i - 

input  I 

ports:  t  2  3  4 


t  t  T  T  input 

I  Butter  Addr  Status  I  ports: 


Prccnilect 

Status 

search 

Store 

(PSS) 

Phys  Acfdr  |  Deal  Addr  |  Size 


Outstanding  Memory  search 
Request  Store  (OM(^) 


size  addr 

data 

writet 

size 

read 

addr  write2 

addr 

data 

data 

DRP  Block  Assembly  Cache  (BAD 
Precollect  Data 

Bus  Interface 


Data 

Cache 


memory  load  request 
virtual  addresses 
(from  processor) 


to  additional 
read  port 


from  additional 
read  port 


-  to  cache  refill 
*  write  port 


btad  Assembly  Butfers 


Figure  3:  Data  Relocation  and  Prefetch  Unit  Data  Path 


2.4.1  Precollect  Status  Store 

When  a  precollect  instruction  is  executed,  we  allocate  an  entry  in  the  Precollect  Status  Store  (PSS). 
Each  PSS  entry  has  a  bit  for  each  cache  block  of  the  relocation  buffer  into  which  we  are  precollecting. 
All  the  bits  are  initialized  to  zero  when  the  entry  is  first  allocated.  As  each  cache  block  of  relocated 
data  is  finally  copied  into  the  data  cache,  the  corresponding  bit  in  the  PSS  is  set.  Once  all  bits 
for  an  entry  are  set,  the  precollect  instruction  is  complete,  and  we  remove  the  entry  from  the  PSS. 
Note  that  more  than  one  precollect  may  be  in  progress  if  the  memory  can  return  data  out-of-order. 

The  cache  controller  must  check  the  PSS  each  time  the  cache  is  accessed  by  the  processor  to 
ensure  that  the  block  it  is  attempting  to  access  has  been  precollected.  If  the  corresponding  bit 
is  zero  then  the  cache  is  blocked  until  the  bit  for  that  block  is  set.  If  an  entry  is  not  found  in 
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the  PSS  for  the  block  being  accessed,  the  cache  can  proceed  because  either  the  corresponding 
precollect,  instruction  is  completed,  or  the  data  being  accessed  was  not  transformed  for  relocation. 
As  discussed  in  Section  1,  the  PSS  contributes  to  the  desired  overlap  of  program  execution  with 
DRP  by  providing  a  mechanism  to  synchronize  the  CPU  and  the  DRP  unit  when  necessary. 

Entries  in  the  PSS  must  be  allocated  immediately  upon  the  issue  of  a  precollect  instruction  so 
that  an  access  to  the  precollected  data  issued  the  immediately  following  cycle  will  correctly  cause 
the  cache  to  block.  Therefore,  the  number  of  write  ports  into  the  PSS  should  equal  the  number  of 
precollects  allowed  to  issue  in  the  same  cycle. 

The  PSS  is  fully-associative  and  is  accessed  from  one  of  three  sources:  the  data  cache,  the 
Block  Assembly  Cache  (BAG)  (discussed  in  Section  2.4.5),  and  the  Address  Generator  (discussed 
in  Section  2.4.3).  There  should  be  a  dedicated  port  into  the  PSS  from  the  data  cache  for  each  of 
the  memory  load  requests  that  the  processor  can  issue  simultaneously.  The  other  two  requests  are 
multiplexed  into  one  port,  as  shown  in  Figure  3.  All  of  these  ports  perform  an  associative  search 
on  the  buffer  address.  Priority  to  access  the  shared  port  is  given  to  the  Address  Generator,  since 
stalls  may  impact  performance  significantly.  The  algorithm  fiow  chart  for  the  PSS  access  port  is 
illustrated  in  Figure  4. 

2.4.2  Instruction  Queue 

Once  the  precollect  instruction  is  decoded,  the  operands  are  fetched  from  the  register  file  and  placed 
in  the  Instruction  Queue  (IQ).  As  with  the  PSS,  the  IQ  must  have  as  many  write  ports  as  the  number 
of  precollects  allowed  in  the  same  cycle.  In  addition,  to  avoid  stalling  the  processor,  both  the  PSS 
and  the  IQ  need  enough  entries  to  hold  the  maximum  number  of  “in-flight”  precollects.  If  either 
the  IQ  or  the  PSS  is  full  when  a  new  precollect  is  sent  from  the  processor,  the  DRP  unit  stalls  the 
processor  until  there  is  an  empty  entry.  By  buffering  precollect  instructions  as  they  are  received 
from  the  CPU,  the  IQ  allows  asynchronous  operation  of  the  CPU  and  DRP  unit. 

An  entry  is  dequeued  from  the  IQ  as  soon  as  it  is  latched  into  the  Address  Generator,  discussed 
in  Section  2.4.3. 
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Figure  4:  Algorithm  Flow  Chart  for  the  Precollect  Status  Store  Access  Port 

2.4.3  Address  Generator 

The  Address  Generator  simultaneously  calculates,  for  each  array  element,  its  original  address  and 
its  destination  address.  To  do  this  it  uses  the  starting  array  address,  stride,  starting  relocation 
buffer  address  and  element  size.  The  compiler  forces  all  relocation  buffers  to  align  to  virtual  page 
boundaries,  and  therefore,  memory  block  boundaries.  Consequently,  the  lower  bits  of  the  relocation 
buffer  addresses  are  zero  and  we  can  simply  append  the  computed  element  offset  to  the  high-order 
bits  of  the  relocation  buffer  address.  Also,  the  size  field  is  encoded,  since  the  data  element  size  (in 
bytes)  can  be  any  power  of  two  up  to  the  bus  transfer  width.  Step  5  of  Figure  5  corresponds  to 
the  generation  of  these  addresses,  as  described  above.  Additionally,  a  counter  is  used  to  determine 
how  many  element  addresses  remain  to  be  generated. 

After  the  source  and  destination  addresses  are  generated,  but  before  attempting  to  access  the 
corresponding  data,  an  entry  must  exist  in  the  BAC,  described  in  Section  2.4.5.  The  BAC  contains 
the  logic  to  align  the  requested  element  within  the  received  block  of  data  into  the  corresponding 
relocation  buffer  cache  block.  If  the  element  to  be  precollected  will  reside  at  the  start  of  a  relocation 
buffer  cache  block,  an  entry  must  be  allocated  in  the  BAC.  To  determine  if  this  element  is  the  start 
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Figure  5:  Algorithm  Flow  Chart  for  the  Address  Generator 


of  a  new  cache  block,  the  Address  Generator  checks  if  the  destination  address  low  order  bits  are 
zero,  and  if  so,  allocates  an  entry  in  the  BAG  once  there  is  space.  Furthermore,  a  counter  associated 
with  each  BAG  entry  is  set  to  the  number  of  precollected  elements  that  will  reside  in  that  cache 
block,  which  is  determined  using  the  element  size,  the  block  size  and  the  Address  Generator  counter 
value.  Steps  7  through  10  of  Figure  5  correspond  to  these  actions. 

Next,  a  read  request  is  sent  to  the  cache.  If  data  is  present  in  the  cache,  it  is  written  to  the  BAG, 
using  the  destination  address  and  size.  These  actions  are  shown  in  Steps  11  and  12  of  Figure  5. 

If  the  data  is  not  present  in  the  cache,  the  cache  does  not  send  a  read  request  to  fetch  the 
data.  Instead,  the  DRP  unit  sends  the  read  request  to  memory  using  the  source  address.  Step  16 
of  Figure  5  corresponds  to  this  action.  Before  sending  the  request,  the  address  is  translated  by  the 
TLB.  Because  the  DRP  unit  needs  to  access  the  TLB  for  every  element,  a  dedicated  port  into  the 
TLB  is  required.  The  destination  address  and  size  are  sent  to  the  Outstanding  Memory  Request 
Store  (OMRS),  discussed  in  Section  2.4.4,  for  later  use.  Similarly,  the  physical  address  is  sent  from 
the  TLB  to  the  OMRS.  Steps  14  and  15  of  Figure  5  correspond  to  these  actions. 

Because  we  may  not  be  precollecting  into  the  maximum  number  of  blocks  allowed  per  relocation 
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Figure  6:  Algorithm  Flow  Chart  of  the  Outstanding  Memory  Request  Store  Logic 

buffer,  some  status  bits  in  the  PSS  may  be  unused  and  will  never  be  set  by  the  data  returning  from 
the  precollect.  Since  an  entry  is  removed  from  the  PSS  only  when  all  of  its  status  bits  are  set,  these 
unused  bits  must  somehow  be  set.  Therefore,  when  the  Address  Generator  counter  reaches  zero  all 
of  the  addresses  have  been  generated,  so  the  bits  corresponding  to  unused  blocks  in  the  relocation 
buffer  are  set.  Steps  4  and  20  of  Figure  5  correspond  to  these  actions. 

Each  precollect  instruction  is  processed  completely  by  the  Address  Generator  before  the  next 
may  proceed.  When  the  Address  Generator’s  counter  has  reached  zero,  requests  have  been  gener¬ 
ated  for  every  element,  and  the  next  precollect  instruction  may  be  latched  from  the  IQ. 

2.4.4  Outstanding  Memory  Request  Store 

As  mentioned  in  Section  2.4.3,  when  a  request  is  sent  to  the  memory  system  by  the  Address 
Generator,  the  physical  address,  destination  address  and  size  are  saved  in  the  Outstanding  Memory 
Request  Store.  The  algorithm  flow  chart  of  the  OMRS’s  logic  is  shown  in  Figure  6.  Each  entry  in 
the  OMRS  corresponds  to  one  outstanding  request. 

If  the  OMRS  has  enough  entries,  the  store  does  not  block  while  it  waits  for  a  memory  access 
to  complete.  In  order  to  minimize  the  probability  that  the  OMRS  blocks,  the  store  should  have  as 
many  entries  a.s  the  maximum  memory  latency  divided  by  the  Address  Generator  cycle  time.  By 
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keeping  track  of  multiple  outstanding  requests,  the  OMRS  exploits  available  bus  bandwidth,  one 
of  the  objectives  discussed  in  Section  1. 

When  the  data  returns  from  memory,  the  physical  address  fields  of  the  OMRS  are  searched 
associatively  using  the  physical  address  from  memory  to  find  the  corresponding  entry.  The  re¬ 
quested  data  is  then  written  to  the  BAG  using  the  destination  address  and  size  from  the  OMRS. 
Additionally,  that  entry  is  removed  from  the  OMRS.  These  last  three  actions  are  represented  by 
Steps  2,  3  and  4,  respectively,  of  Figure  6. 

2.4.5  DRP  Block  Assembly  Cache 

In  order  to  handle  out-of-order  return  of  data  from  the  memory  system,  we  use  a  small  fully- 
associative  cache  in  the  DRP  unit,  called  the  DRP  Block  Assembly  Cache,  to  align  and  assemble 
the  precollected  elements  into  cache  blocks.  Each  entry  of  the  BAG  holds  a  cache  block  that  is  a 
portion  of  a  relocation  buffer.  The  assembly  of  each  destination  cache  block  is  performed  in  the 
BAG  instead  of  in  the  data  cache,  to  minimize  the  interference  with  regular  cache  operation.  The 
BAG  also  eliminates  the  need  for  a  dedicated  write  port  into  the  cache,  because  the  cache  refill 
write  port  used  by  memory  to  write  load  miss  data  can  be  shared  with  the  DRP  unit.  After  an 
entire  block  is  assembled,  it  is  written  into  the  cache.  Then,  the  corresponding  bit  in  the  PSS  is 
set  and  the  block  is  removed  from  the  BAG.  For  each  block  in  the  BAG,  a  counter  keeps  track  of 
how  many  elements  have  yet  to  be  received.  This  counter  is  initially  set  by  the  Address  Generator, 
as  described  in  Section  2.4.3.  Gonsequently,  the  BAG  knows  when  the  block  is  fully  assembled. 

Because  the  BAG  is  small,  there  may  not  be  enough  entries  and  the  OMRS  may  have  to  stall. 
To  avoid  stalls,  the  BAG  should  be  fully- associative  and  have  enough  entries  to  hold  the  maximum 
number  of  blocks  “in-flight” .  The  number  of  blocks  required  depends  on  the  memory  latency,  bus 
bandwidth  and  the  average  element  size.  For  our  configuration,  which  has  a  10-cycle  memory 
latency,  five  entries  are  enough.  Note  that  the  DRP  unit  can  be  easily  interfaced  with  a  banked 
memory  system  since  the  BAG  handles  out-of-order  data  return. 

2.4.6  Virtual  Versus  Physical  Cache  Considerations 

Our  implementation  thus  far  uses  a  virtual  data  cache.  Therefore,  we  attach  the  process  identifier 
to  the  data  cache  tags  so  that  the  data  cache  need  not  be  drained  when  a  context  switch  occurs. 
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For  the  same  reason,  entries  in  the  IQ,  PSS,  OMRS  and  BAG  also  need  to  have  process  identifiers 
attached.  Otherwise,  we  must  drain  the  DRP  unit  completely  before  performing  a  context  switch. 
This  approach,  of  course,  would  result  in  prolonged  context  switching  latencies.  Assuming  that  we 
attach  process  identifiers,  then  the  actions  taJcen  in  response  to  a  context  switch  depend  on  the 
cause  of  the  context  switch.  If  the  context  switch  was  caused  by  anything  other  than  a  page  fault, 
no  actions  need  be  taken,  and  the  DRP  unit  can  continue  normal  execution. 

If  a  context  switch  is  due  to  a  page  fault  triggered  by  a  precollect  instruction,  then  a  small  part 
of  the  DRP  state  must  be  saved  to  the  processor,  since  the  DRP  unit  is  in  the  process  of  executing 
the  faulting  precollect  instruction  (Step  17  of  Figure  5).  After  the  page  fault  has  been  handled 
and  execution  of  the  faulting  process  resumes,  the  saved  state  is  used  to  restart  the  precollect 
instruction  at  a  point  where  the  fault  occurred.  Precollect  instructions  in  the  IQ  which  are  issued 
after  the  faulting  precollect  instruction  may  begin  execution  immediately  after  the  context  switch. 
In  order  to  simplify  the  implementation,  immediately  before  the  faulting  process  resumes  execution, 
the  operating  system  emulates  the  execution  of  the  faulting  precollect  instruction  using  ordinary 
processor  load  and  store  instructions. 

At  the  point  of  page  fault  detection  for  a  precollect  instruction,  we  save  the  precollect’s  state 
information  from  the  Address  Generator.  As  shown  in  Figure  3,  the  state  information  saved  consists 
of  the  source  address  that  faulted,  the  number  of  remaining  elements  (from  the  counter),  the 
destination  address,  stride,  and  size.  To  avoid  having  a  single  partially  filled  block  remain  in  the 
BAG,  the  destination  address  of  the  element  that  caused  the  fault,  the  element  size  and  the  block 
size  are  used  to  set  the  counter  of  that  BAG  entry  to  the  number  of  elements  precollected  before  the 
page  fault  (Step  18  of  Figure  5).  Therefore,  after  the  OMRS  drains,  this  partially  completed  block 
of  relocated  data  will  be  written  into  the  data  cache.  Finally,  the  entry  in  the  PSS  corresponding 
to  the  faulting  precollect  instruction  is  removed  using  the  destination  address  from  the  Address 
Generator  (Step  19  of  Figure  5,  Steps  6  and  10  of  Figure  4). 

If  the  data  cache  is  physically  addressed,  we  assume  that  address  translation  occurs  in  parallel 
with  access  of  the  cache  tag  store  using  the  virtual  address.  The  BAG  is  also  physically  addressed 
using  the  same  assumptions.  In  Figure  3,  the  read  and  write  ports  for  the  data  cache  and  the  write 
ports  for  the  BAG  use  virtual  addresses  since  the  address  translation  is  initiated  by  the  caches. 
The  DRP  unit  should  still  use  virtual  addresses,  since  the  PSS,  IQ,  Address  Generator  and  OMRS 
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use  the  virtual  addresses  provided  by  the  precollect  instruction. 

For  a  physically  addressed  cache,  on  any  type  of  a  context  switch,  the  same  process  described 
above  for  a  virtually  addressed  cache  must  be  performed,  but  the  entire  IQ  and  PSS  must  also 
be  saved  as  part  of  the  state.  However,  in  this  case,  the  DRP  state  is  restored  in  hardware  when 
the  faulting  process  resumes  execution,  rather  than  performing  software  emulation,  since  software 
emulation  will  not  simplify  hardware. 

2.4.7  Relocation  Buffer  and  DRP  Component  Size  Considerations 

Another  issue  that  needs  to  be  considered  is  the  relocation  buffer  size.  The  buffer  space  should  be 
the  same  size  or  smaller  than  the  data  cache  so  that  there  are  no  cache  conflicts  between  relocation 
buffers.  Within  this  area,  we  will  allocate  several  relocation  buffers  corresponding  to  different 
arrays  in  the  loop  nest.  The  relocation  buffer  size  imposes  a  maximum  on  the  number  of  arrays  per 
loop  nest  which  may  be  transformed.  For  our  experiments,  we  used  an  8K-byte  cache  and  128-byte 
relocation  buffers,  which  allowed  us  64  relocation  buffers  per  loop  nest.  However,  as  explained  in 
Section  2.3,  two  relocation  buffers  are  needed  per  array  to  accommodate  software  pipelining,  so  we 
can  transform  32  arrays  per  loop  nest.  Although  64  entries  is  the  maximum  needed  for  both  the 
IQ  and  PSS,  software  pipelining  allows  us  to  use  32  entries  for  each  with  only  a  small  possibility 
of  stalling. 

If  more  than  128  bytes  of  array  data  are  accessed  in  the  inner  loop,  multiple  precollects  can 
be  executed.  The  compiler  aligns  the  corresponding  relocation  buffers  sequentially  in  memory  so 
that  they  appear  as  one  large  relocation  buffer.  Therefore,  the  size  of  the  relocation  buffer  does 
not  inherently  limit  the  amount  of  data  one  can  precollect  for  the  inner  loop  execution. 

3  Experimental  Environment 

In  this  section,  the  environment  used  for  experimental  evaluation  of  the  DRP  technique  is  presented 
.  The  applications  for  this  study  consist  of  seven  numeric  benchmark  programs;  ARC2D,  ADM, 
EDNA,  and  OCEAN  ixom  PERFECT  [19],  MATRIX300  from  SPEC’89,  and  NASA?  smd  TOM- 
CATV  from  SPEC’92.  The  experimental  environment  includes  the  compiler  support,  emulation 
to  verify  transformation  correctness,  and  the  simulation  techniques  used  to  generate  experimental 
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Function 

Latency 

Function 

Latency 

Int  ALU 

1 

FP  ALU 

2 

memory  load 

2 

FP  multiply 

2 

memory  store 

1 

FP  divide  (single  prec.) 

8 

branch 

1  +  1  slot 

FP  divide  (double  prec.) 

15 

Table  1:  Instruction  latencies  for  simulation  experiments. 


results. 

3.1  Compiler  Support 

In  the  high-level  IMPACT  compiler  phases,  the  applications  were  profiled  at  the  loop- level  to  obtain 
the  number  of  invocations  and  the  number  of  iterations  for  all  loops  in  order  to  apply  the  DRP 
transformations  selectively  and  effectively.  Data  dependence  analysis  is  performed  using  the  Omega 
test  [20]  [21]  to  exclude  inner  loops  with  specific  cross-iteration  dependences  which  can  prevent  the 
necessary  transformations. 

In  order  to  provide  a  realistic  evaluation  of  the  DRP  technique,  we  first  optimized  the  code 
using  the  machine-specific  phases  of  IMPACT  compiler.  Classical  optimizations  were  applied,  then 
optimizations  were  performed  which  increase  instruction  level  parallelism  such  as  loop  unrolling 
and  superblock  formation  [22].  The  code  was  scheduled,  register  allocated,  and  optimized  for  a 
four-issue,  scoreboarded,  superscalar  processor  with  register  renaming.  Each  of  the  four  functional 
units  are  pipelined  and  can  execute  any  type  of  instruction.  The  register  file  contains  64  integer 
registers  and  64  double-precision  floating-point  registers. 

3.2  Transformation  Correctness  Verification  via  Emulation 

To  verify  the  correctness  of  the  code  transformations,  emulation  of  the  generated  code  was  per¬ 
formed  on  a  Hewlett-Packard  PA-RISC  7100  workstation.  The  precollect  instruction  is  emulated 
using  a  machine  language  subroutine  that  performs  the  data  relocation  from  memory  to  memory 
instead  of  to  and  from  the  cache.  Thus,  the  transformed  code  must  relocate  the  data  and  reference 
it  using  the  correct  addresses  for  the  emulation  to  produce  valid  results. 


3.3  Simulation  Parameters  and  Techniques 

The  emulator  drives  the  simulator  that  models  the  processor  and  the  DRP  unit  to  determine 
application  execution  time,  cache  performance,  bus  utilization,  and  processor  stall  overhead  due 
to  precollect  instructions.  The  simulation  latencies  used  are  those  of  a  Hewlett-Packard  PA-RISC 
7100  microprocessor,  as  given  in  Table  1. 

The  processor  model  includes  separate  instruction  and  data  caches  that  are  direct-mapped, 
8k-byte  blocking  caches  with  a  16-byte  block  size.  In  order  to  increase  the  accuracy  of  the  results, 
the  small  data  cache  size  was  chosen  to  match  the  application  data  set  size,  which  is  reduced 
from  the  data  set  size  of  the  actual  applications.  The  data  cache  is  a  multiported,  write-back,  no 
write-allocate  cache  that  satisfies  up  to  four  load  or  store  requests  per  cycle  from  the  processor. 
Any  single  load  miss  blocks  the  processor,  but  up  to  four  load  misses  (one  per  cache  port)  can  be 
outstanding  simultaneously,  because  the  data  fetches  are  pipelined  on  the  split-transaction  memory 
bus.  An  8-entry  write  buffer  combines  write  requests  to  the  same  cache  block.  The  instruction 
cache  and  data  cache  share  a  common,  split- transaction  memory  bus,  with  a  8  bytes/cycle  data 
bandwidth.  A  pipelined  memory  model  is  used  with  a  10-cycle  latency.  The  latency  of  a  cache 
block  fetch  from  memory  is  13  cycles,  one  for  the  request,  ten  for  the  memory  latency,  and  two 
for  the  return  of  two  8-byte  quantums  of  data.  However,  since  the  8-byte  quantum  containing  the 
requested  data  is  fetched  from  memory  before  the  other  8-byte  quantum  in  the  cache  block,  the 
load  miss  penalty  is  actually  12  cycles. 

A  direct-mapped  branch  target  buffer  with  1024  entries  is  used  to  perform  dynamic  branch 
prediction  using  a  2-bit  counter.  Hardware  speculation  is  supported,  and  the  branch  misprediction 
penalty  is  approximately  two  cycles. 

The  simulation  model  for  the  DRP  unit  is  based  on  the  hardware  description  in  Section  2.4.  For 
instance,  the  number  of  queue  and  buffer  entries  are  chosen  as  indicated  in  the  hardware  description 
so  that  no  blocking  occurs  in  the  DRP  unit  due  to  insufficient  entries. 

Since  simulating  the  entire  applications  at  this  level  of  detail  would  be  impractical,  uniform 
sampling  is  used  to  reduce  simulation  time  [23],  however  emulation  is  still  performed  between 
samples.  The  samples  are  200,000  instructions  in  length  and  are  spaced  evenly  every  20,000,000 
instructions,  yielding  a  1%  sampling  ratio.  Most  of  the  applications  used  have  more  than  a  bil¬ 
lion  dynamic  instructions,  at  least  50  samples,  and  thus,  more  than  10,000,000  instructions  are 
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a)  Number  of  DHP-transformed  loop  nests  for  each  applica- 

tion,  and  percentage  of  original  code  execution  time  for  all  b)  Original  code  loop  nest  execution  time  percentages  for 

transformed  loop  nests  per  application.  important  DHP-transformed  loop  nests. 


Table  2:  DRP  transformation  statistics. 


simulated.  For  smaller  applications,  the  time  between  samples  is  reduced  to  maintain  at  least 
50  samples  (10,000,000  instructions).  Prom  experience  with  the  emulation-driven  simulator,  we 
have  determined  that  sampling  with  at  least  50  samples  introduces  typically  less  than  1%  error  in 
generated  performance  statistics. 

4  Experimental  Evaluation  and  Analysis 

In  order  to  show  the  full  performance  benefit  of  the  DRP  technique,  experimental  results  are 
presented  for  all  application  loop  nests  which  are  transformed  by  the  DRP  technique  .  Results  for 
the  entire  applications  are  also  presented. 

4.1  Individual  Loop  Nest  Results 

Performance  statistics  for  individual  loop  nests  are  obtained  by  marking  the  Z)RP-transformed 
loop  nests  as  regions  for  gathering  simulation  statistics.  Table  2a)  lists  the  total  number  of  loop 
nests  that  were  DPP-transformed,  as  well  as  the  percentage  of  total  original  code  execution  time 
they  represent,  for  each  application.  Table  2b)  shows  the  original  code  execution  time  for  each 
of  the  most  important  loop  nests  transformed  by  the  DRP  technique  as  a  percentage  of  the  total 
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original  code  execution  time.  The  execution  time  percentages  are  determined  through  simulation, 
and  therefore,  include  all  memory  system  and  processor  model  effects  discussed  in  Section  3.3.  The 
transformed  loop  nests  are  identified  by  function  name,  the  Fortran  outer  DO-loop  number,  and 
the  loop  iteration  variable  if  necessary.  The  most  important  loop  nests  are  those  which  account  for 
over  5%  of  the  application  execution  time,  with  the  exception  of  ADM.  ADM  has  no  transformed 
loop  nests  which  account  for  over  5%  of  the  application  execution  time,  so  the  only  loop  nest  which 
accounts  for  over  1%  of  the  execution  time  is  shown.  As  shown  in  the  table,  the  total  execution 
percentage  of  the  transformed  loop  nests  for  each  application  varies  widely  from  3.1%  to  97.6%. 

Figure  7  presents  the  measured  speedup  of  the  DRP  technique  which  is  calculated  by  dividing 
the  original  loop  nest  execution  time  by  the  DJ?P-transformed  loop  nest  execution  time.  Measured 
speedup  for  most  loop  nests  is  relatively  large,  demonstrating  the  high  performance  improvement 
obtainable  using  the  DRP  technique.  The  smallest  speedup  in  execution  time  obtained  for  any 
important  loop  nest  is  1.21. 

The  speedup  obtained  using  the  DRP  technique  is  affected  by  several  factors.  First,  the  number 
of  cache  misses  decreases  with  the  application  of  the  DRP  technique,  which  will  be  quantified  using 
cache  miss  rates  from  simulation.  Second,  the  increase  in  bus  contention  caused  by  extra  DRP 
accesses  can  reduce  speedup.  Bus  utilization  data  will  be  shown  to  quantify  the  effect  of  bus 
contention  on  speedup.  Finally,  the  processor  stalls  due  to  unfinished  precollect  operations  can 
cause  severe  degradation  of  execution  time,  which  will  be  quantified  using  simulation  data.  The 
effect  of  stalls  due  to  insufficient  entries  in  the  DRP  queues  and  buffers  is  not  evaluated  in  this 
paper  due  to  space  constraints. 

Figure  7  shows  the  data  cache  miss  rate  for  the  original  code  and  the  DPP- transformed  code 
for  the  loop  nests.  Since  a  no  write-allocate  cache  is  used  for  these  experiments,  the  miss  ratios  for 
both  original  and  transformed  code  are  calculated  by  dividing  the  number  of  cache  read  misses  by 
the  number  of  cache  read  requests  in  the  original  code.  This  method  of  calculating  the  cache  misses 
assures  a  fair  comparison  if  the  number  of  cache  accesses  for  the  transformed  code  is  different  from 
the  number  for  the  original  code,  which  may  occur  due  to  DRP  compiler  transformations. 

Note  that  cache  misses  are  nearly  eliminated  for  most  of  the  loop  nests.  Since  the  compiler  relo¬ 
cates  nearly  all  array  data  read  in  the  loop  nest,  there  are  few  possibilities  for  capacity  and  conflict 
cache  misses  to  occur.  Furthermore,  prefetching  the  array  data  eliminates  the  compulsory  misses. 
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Figure  7;  Speedup  of  the  DiZP-transformed  code  over  the  original  code,  and  cache  miss  rate  of  the 
PPP-transformed  code  and  the  original  code,  for  loop  nests. 
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Figure  8:  Bus  utilization  and  Precollect  stall  overhead  for  the  PPP- transformed  code  and  the 
original  code  for  loop  nests. 


Scalar  variables  accessed  in  the  loop  nest  are  not  relocated  and  may  conflict  with  relocated  array 
data,  causing  cache  misses.  For  instance,  the  STEPFY.^SO  loop  nest  in  the  ARC2D  application 
contains  inner-loop  scalar  accesses  which  contribute  to  the  non-zero  cache  miss  rate.  In  the  loop 
nest  ACTFOR.350  in  BDNA,  only  half  the  array  reads  are  relocated  due  to  data  dependences  and 
complicated  subscript  expressions,  which  results  in  a  relatively  large  percentage  of  remaining  cache 
misses.  Most  of  the  remaining  cache  misses  are  due  to  additional  memory  accesses  introduced  by 
register  spill  code. 

Figure  8  displays  the  percentage  of  execution  cycles  for  which  the  processor  is  stalled  waiting 
for  a  precollect  to  complete  in  the  transformed  loop  nests,  and  the  memory  bus  utilization  for 
the  transformed  and  original  loop  nests.  For  the  loop  nests  with  small  stall  overhead  (less  than 
5%),  the  stalls  are  due  almost  entirely  to  the  software-pipeline  startup  overhead  in  the  first  outer 
loop  iteration.  The  DRP  unit  utilizes  the  unused  bus  cycles  because  it  has  lower  priority  than 
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Figure  9:  Data  cache  miss  ratio  for  original  code  and  DiJP-transformed  code,  and  speedup  of  the 
DPP-transformed  code  over  the  original  code. 

the  cache  when  accessing  the  bus.  However,  the  DRP  transformation  does  not  cause  more  data 
to  be  brought  to  the  CPU,  but  instead  overlaps  the  data  fetches  to  take  full  advantage  of  the 
bus  bandwidth.  Consequently,  bus  utilization  increases  because  the  program  execution  is  sped  up. 
Since  the  application  programs  tend  to  be  limited  by  data  access  rather  than  computation,  large 
stall  overhead  indicates  that  that  no  more  performance  can  be  gained  unless  the  bus  bandwidth  is 
increased. 


4.2  Entire  Benchmark  Results 

The  data  cache  read  miss  ratios  show  promising  improvement  for  several  applications  shown  in 
Figure  9.  However,  applications  ADM  and  OCEAN  show  little  improvement  in  cache  performance 
since  only  a  small  percentage  of  the  application  execution  is  transformed  by  the  DRP  technique.  In 
addition,  application  BDNA  contains  a  single  transformed  loop  with  only  half  the  array  references 
relocated,  so  conflict  misses  are  not  effectively  reduced.  However,  the  cache  miss  reductions  are  large 
compared  to  the  transformed  loop  nest  execution  percentage  for  ARC2D  and  NASA7  indicating 
that  the  loop  nests  with  the  worst  cache  behavior  were  indeed  transformed. 

Speedups  for  the  simulated  execution -of  the  Di?P-transformed  code  over  the  original  code  for 
the  six  applications  are  shown  in  Figure  9.  The  total  speedups  for  ARC2D,  MATRIX300,  NASA? 
and  TOMCATV  are  high,  although  the  rest  of  the  applications  show  much  smaller  gains  in  total 
performance.  For  ADM  and  OCEAN  the  small  speedup  is  attributable  to  the  fact  that  percentage 
of  the  execution  time  spent  in  the  transformed  loop  nests  is  relatively  small,  (3.1%  and  20.2% 
respectively).  In  the  case  of  BDNA,  a  single  loop  nest  dominates  the  entire  execution  time.  As 
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Figure  10:  Speedup  comparison  of  the  DRP  technique  with  8K-byte  direct-mapped  caches  to 
original  code  with  caches  of  larger  capacity  and  set-associativity. 

mentioned  earlier,  only  half  of  the  array  references  in  the  loop  nest  were  transformed  for  DRP, 
resulting  in  the  small  speedup. 

In  order  to  evaluate  the  effectiveness  of  the  DRP  technique  as  opposed  to  just  increasing  the 
cache  size  or  associativity,  we  simulated  the  execution  of  the  original  code  using  various  cache  design 
parameters.  Figure  10  shows  the  application  execution  speedup  for  the  DRP  technique  using  an 
8K-byte  direct-mapped  (DM)  cache  plotted  with  the  speedup  for  16K,  32K,  and  64K-byte  DM 
caches  as  well  as  for  2-way,  4-way  and  8-way  set-associative  8K-byte  caches.  All  speedups  shown 
are  with  respect  to  the  execution  of  the  original  code  using  an  8K-byte  DM  cache. 

For  ARC2D,  MATRIX300,  NASA?,  and  TOMCATV,  the  DRP  technique  with  an  8K-byte 
cache  outperforms  caches  with  up  to  64K-byte  capacity  by  a  large  margin.  DRP  still  outperforms 
up  to  32K-byte  caches  for  BDNA  and  OCEAN,  even  though  the  application  of  DRP  is  limited.  For 
ADM,  however,  DRP  is  ineffective  in  reducing  the  execution  time,  so  larger  caches  are  the  clear 
winner. 

The  DRP  technique  is  much  more  effective  than  increasing  the  set  associativity  for  all  the 
applications  but  ADM.  Since  most  of  the  cache  misses  are  due  to  limited  cache  capacity  rather 
than  mapping  conflicts,  increasing  the  associativity  is  of  limited  benefit. 

5  Conclusions 

5.1  Summary 

An  architectural  extension,  referred  to  as  data  relocation  and  prefetching,  is  proposed  to  perform 
data  relocation  and  compression  during  prefetching.  Data  relocation  is  employed  to  remove  array- 
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data  mapping  conflicts  by  compressing  the  accesses  in  a  loop  nest  into  sequential  locations  in  the 
cache.  Compression  also  improves  utilization  of  the  cache  by  transforming  non-unit  stride  and  array 
column  accesses  into  sequential  accesses  that  require  fewer  cache  lines  for  storage.  Furthermore, 
reduction  of  the  cache  space  used  to  hold  the  data  in  the  loop  nest  can  increase  the  data  reuse 
across  transformed  loop  nests. 

The  hardware  design  meets  the  objectives  of  asynchronous  operation  of  the  CPU  and  DRP  unit 
by  allowing  overlap  of  program  execution  and  DRP,  high  utilization  of  memory  bus  bandwidth, 
and  support  for  out-of-order  return  of  data  from  the  memory  system.  The  Precollect  Status  Store 
and  Instruction  Queue  components  realize  the  first  objective,  the  Outstanding  Memory  Request 
Queue  realizes  the  second,  and  the  Block  Assembly  Cache  realizes  the  third. 

By  combining  the  data  relocation  and  prefetching  hardware  with  supporting  compiler  transfor¬ 
mations,  the  performance  of  loop  nests  is  greatly  improved  for  a  set  of  array-based  benchmarks. 
Furthermore,  we  have  shown  that  application  of  the  data  relocation  and  prefetching  technique 
greatly  improves  the  cache  performance.  Finally,  the  DRP  technique  compares  very  favorably  to 
increasing  the  cache  capacity  or  set-associativity  in  terms  of  application  execution  speedup. 

5.2  Future  Research 

For  the  programs  that  contain  many  transformed  arrays,  some  technique  is  necessary  to  reduce  the 
bus  traffic.  Use  of  separate  request  and  return  data  busses  could  be  investigated  for  this  purpose, 
as  well  as  some  technique  to  combine  multiple  bus  accesses  into  a  single  access.  In  addition,  the 
compiler  transformations  can  be  expanded  and  improved  in  order  to  transform  more  loop  nests 
effectively.  Further  experiments  are  warranted  to  study  the  performance  of  this  technique  by 
varying  implementation  parameters  for  the  DRP  hardware. 
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a)  Original  Loop  Nest 

for  (i*0:  i<N;  i++) 
for  (j*0;  j<N;  j++)  { 

C[i]Cj]  =  ACi]C2*j3  +  BCjDCi]; 
DCi]Cj]  »  C[i]Cj]  -  BCj+i][i]: 

> 


b)  After  Loop  Unrolling 

for  (i’*0;  i’<K;  i'+*2)  { 

/*  first  outsr  loop  body  «/ 
i  =  i’ ; 

for  (j*0:  j<N;  j++) 

CCi3Cj]  »  ACi]C2*j]  +  BCj]Ci]; 
DCi][j]  -  CCi][j]  -  BCj  +  l3Ci]; 

/«  second  outer  loop  body  •/ 

i++; 

if  (i  <  N)  { 

for  (j«0;  j<N:  j++) 

CCi]Cj]  »  A[i]C2*j]  +  BCj]Ci]; 
DCilCj]  *  C[i][j]  -  BCj+l]Ci]; 

} 


c)  After  Transformations 

/*  Bd  •  dimensionCBD)  •  8  bytes/element  ' 

/•  prologue:  for  first  outer  loop  body  •/ 
precollect(AA[0] [0] ,  A’,  8,  16,  N) ; 
precollect (tBCO]  Co]  ,  B’,  8,  Bd,  N-^1); 

for  (i'*0;  i’<M;  i’+*2)  { 

/•  first  outer  loop  body  •/ 
i  »  i»; 

if  (i+1  <  N)  { 

/•  for  second  outer  loop  body  */ 
precollect  (AACi'*'!]  CO]  ,A’’,8,16,N); 
precollect (ABCO] Ci+l3 ,B’ ' ,8,Bd,N+l) ; 

} 

for  (j*0;  j<N;  j++)  i. 

CCilCj]  »  A«Cj]  +  B'Cj]; 

D[i][j]  -  cCiDCj]  -  B>Cj+i]; 

> 

/«  second  outer  loop  body  */ 
i++; 

if  (i  <  N)  < 
if  (i+1  <  N)  •{ 

/•  for  first  outer  loop  body  •/ 
precollect (AA [i+1] [0] , A ' ,8 , 16 ,N) ; 
precollect (AB Co] Ci+1] >B' ,8,Bd,N+l) ; 

} 

for  (j*0;  j<N;  j++)  { 

CCUCj]  «  A"Cj]  +  B-Cj3; 

DCilCj]  -  CCilCj]  -  B"Cj  +  lh 

} 

} 

> 


Figure  11:  Transformation  Example  for  DRP 


A  Compiler  Transformations 

A.l  Loop  Strip-mining 

For  the  DRP  technique,  strip-mining  of  the  inner  loop  is  performed  in  order  to  create  a  doubly  nested  loop  if  the  loop 
nest  consists  of  only  a  singly  nested  loop.  This  is  most  beneficial  for  singly  nested  loops  that  issue  strided  references 
to  array  elements.  Strip-mining  is  also  used  in  order  to  reduce  the  amount  of  data  relocated  for  the  inner-loop 
computation  if  that  amount  is  too  large  to  fit  in  the  cache.  This  is  necessary  since  all  elements  that  are  relocated  and 
prefetched  for  the  inner  loop  computation  must  be  allocated  to  unique  cache  locations  to  prevent  possible  conflict 
misses. 

A. 2  Declaration  of  New  Variables  for  DRP 

Next,  the  buffer  space  is  declared  as  an  array  that  has  the  same  size  as  the  cache  size.  This  guarantees  that  the 
relocated  arrays  in  the  buffer  space  have  no  cache  line  conflicts  among  them. 

A. 3  Loop  Unrolling 

In  order  to  overlap  the  data  relocation  and  prefetching  for  the  next  outer-loop  iteration  with  the  computation  for  the 
current  iteration,  the  relocation  and  prefetching  phase  is  software-pipelined  with  the  computation  phase.  The  original 
loop  nest  of  Figure  11a)  is  shown  after  loop  unrolling  in  Figure  11b).  This  software-pipelining  scheme  requires  two 
relocation  buffers.  The  inner  loop  is  duplicated  by  unrolling  the  outer  loop  once.  In  the  first  outer-loop  body,  the 
data  relocation  proceeds  into  the  second  relocation  buffer,  while  the  computation  is  performed  using  the  data  already 
relocated  in  the  first  buffer.  For  the  second  outer-loop  body,  the  same  method  is  used  as  for  the  first  outer-loOp  body 
except  that  the  relocation  buffers  are  switched. 
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A. 4  Insertion  of  Precollect  Operations 

After  creating  a  doubly-nested  loop  by  strip-mining,  precollect  operations  are  inserted  in  the  high-level  code.  These 
high-level  operations  are  replaced  by  the  corresponding  machine  instructions  at  the  assembly  code  level. 

Before  the  outer  loop,  precollect  operations  are  inserted  for  the  first  inner  loop  computation.  Since  these  operations 
cannot  be  overlapped  with  any  other  computation  in  the  loop  body,  these  operations  constitute  the  start-up  overhead 
of  software  pipelining.  Also,  precollect  operations  for  each  inner  loop  within  the  unrolled  outer  loop  axe  inserted  before 
the  inner  loop  in  which  they  are  used.  Each  inner  loop  needs  to  use  a  different  relocation  buffer  since  precollecting 
the  data  for  the  next  loop  is  overlapped  with  the  computation  for  the  current  loop  by  software  pipelining. 

A. 5  Replacement  of  Array  References  with  Relocation  Buffer  References 

Once  the  DRP  operations  have  been  inserted,  the  array  references  for  the  computation  within  the  inner  loop  are 
modified  so  that  the  relocation  buffer  locations  are  accessed  instead  of  the  original  array  locations.  Some  array 
references  that  do  not  need  to  be  relocated  are  left  as  the  original  array  references.  Replaced  array  references  are 
all  one-dimensional  even  if  the  original  curray  references  are  multi-dimensional.  If  more  than  one  reference  within  a 
loop  nest  is  to  the  same  array,  the  reference  patterns  are  analyzed  to  generate  the  minimum  number  of  precollects 
necessary.  For  example,  references  B[j-(-l][i]  and  B[j][i]  in  Figure  11  are  replaced  by  references  to  a  single  relocation 
buffer  B’  (in  the  first  outer  loop  body),  since  the  two  references  overlap  in  most  elements  they  access.  We  only  need 
to  precollect  B[j][i]  for  j  from  0  to  N-1-1.  Therefore  relocation  buffer  B’  contains  N-(-l  elements. 

The  final  transformed  code  of  the  example  in  Figure  11a)  is  shown  in  Figure  11c). 


